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1. Introduction 

Markov chain Monte Carlo is a commonly used approach to evaluating expec- 
tations of the form 9 := J x f(x)ir(dx), where it is an intractable probability 
measure, e.g. known up to a normalising constant. One simulates (X n )„>o, an 
ergodic Markov chain on X, evolving according to a transition kernel P with 
stationary limiting distribution 7r and, typically, takes ergodic average as an 
estimate of 6. The approach is justified by asymptotic Markov chain theory, 
see e.g. [5^1135]. Metropolis algorithms and Gibbs samplers (to be described in 
Section [2]) are among the most common MCMC algorithms, c.f. [3TT [25l [38] . 

The quality of an estimate produced by an MCMC algorithm depends on 
probabilistic properties of the underlying Markov chain. Designing an appropri- 
ate transition kernel P that guarantees rapid convergence to stationarity and 
efficient simulation is often a challenging task, especially in high dimensions. 
For Metropolis algorithms there are various optimal scaling results [32l [36l [9l 
[TOl 21 [3TJ [38l |H] which provide "prescriptions" of how to do this, though they 
typically depend on unknown characteristics of 7r. 
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For random scan Gibbs samplers, a further design decision is choosing the 
selection probabilities (i.e., coordinate weightings) which will be used to select 
which coordinate to update next. These are usually chosen to be uniform, but 
some recent work [26j [22l [23l [15l |43l [12] has suggested that non-uniform weight- 
ings may sometimes be preferable. 

For a very simple toy example to illustrate this issue, suppose X = [0, 1] x 
[—100, 100], with ■k{x 1 ,x 2 ) °c x\ oa (l + sin(x 2 )). Then with respect to Xi, this 
7r puts almost all of the mass right up against the line x\ = 1. Thus, repeated 
Gibbs sampler updates of the coordinate x\ make virtually no difference, and do 
not need to be done often at all (unless the functional / of interest is extremely 
sensitive to tiny changes in x\). By contrast, with respect to X2, this tt is a 
highly multi-modal density with wide support and many peaks and valleys, 
requiring many updates to the coordinate X2 in order to explore the state space 
appropriately. Thus, an efficient Gibbs sampler would not update each of x\ 
and X2 equally often; rather, it would update X2 very often and X\ hardly at 
all. Of course, in this simple example, it is easy to see directly that x\ should 
be updated less than X2, and furthermore such efficiencies would only improve 
the sampler by approximately a factor of 2. However, in a high-dimensional 
example (c.f. |12j). such issues could be much more significant and also much 
more difficult to detect manually. 

One promising avenue to address this challenge is adaptive MCMC algo- 
rithms. As an MCMC simulation progresses, more and more information about 
the target distribution tt is learned. Adaptive MCMC attempts to use this new 
information to redesign the transition kernel P on the fly, based on the current 
simulation output. That is, the transition kernel P n used for obtaining X n \X n -i 
may depend on {Xq, . . . , A„_i}. So, in the above toy example, a good adaptive 
Gibbs sampler would somehow automatically "learn" to update x\ less often, 
without requiring the user to determine this manually (which could be difficult 
or impossible in a very high-dimensional problem). 

Unfortunately, such adaptive algorithms are only valid if their ergodicity can 
be established. The stochastic process (X n ) n >o for an adaptive algorithm is no 
longer a Markov chain; the potential benefit of adaptive MCMC comes at the 
price of requiring more sophisticated theoretical analysis. There is substantial 
and rapidly growing literature on both theory and practice of adaptive MCMC 

(see e.g. [si nzi Ei m nsi Q2i ssi sol 12s bsi 1^1 us ibi isi izi h^i 121 si) which 

includes counterintuitive examples where X n fails to converge to the desired 
distribution it (c.f. 5, 39][8j[2l]), as well as many results guaranteeing ergodicity 
under various assumptions. Most of the previous work on ergodicity of adaptive 
MCMC has concentrated on adapting Metropolis and related algorithms, with 
less attention paid to ergodicity when adapting the selection probabilities for 
random scan Gibbs samplers. 

Motivated by such considerations, in the present paper we study the ergod- 
icity of various types of adaptive Gibbs samplers. To our knowledge, proofs of 
ergodicity for adaptively-weighted Gibbs samplers have previously been consid- 
ered only by [53] , and we shall provide a counter-example below (Example 13. ip 
to demonstrate that their main result is not correct. In view of this, we are not 
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aware of any valid ergodicity results in the literature that consider adapting 
selection probabilities of random scan Gibbs samplers, and we attempt to fill 
that gap herein. 

This paper is organised as follows. We begin in Section [2] with basic defi- 
nitions. In Section [3] we present a cautionary Example I3.1[ where a seemingly 
ergodic adaptive Gibbs sampler is in fact transient (as we prove formally later 
in Section [5]) and provides a counter-example to Theorem 2.1 of [24] , Next, 
we establish various positive results for ergodicity of adaptive Gibbs samplers. 
In Section [H we consider adaptive random scan Gibbs samplers (AdapRSG) 
which update coordinate selection probabilities as the simulation progresses; 
in Section \E\ we consider adaptive random scan Metropolis- within- Gibbs sam- 
plers (AdapRSMwG) which update coordinate selection probabilities as the simu- 
lation progresses; and in Section [SJ we consider adaptive random scan adaptive 
Metropolis- within-Gibbs samplers (AdapRSadapMwG) that update coordinate se- 
lection probabilities as well as proposal distributions for the Metropolis steps 
- the case that corresponds most closely to the adaptations performed in the 
statistical genetics work of [12]. In each case, we prove that under reasonably 
mild conditions, the adaptive Gibbs samplers are guaranteed to be ergodic, al- 
though our cautionary example does show that it is important to verify some 
conditions before applying such algorithms. Finally, in Section [7] we consider 
particular methods of simultaneously adapting the selection probabilities and 
proposal distributions, and prove that in addition to being ergodic, such algo- 
rithms are approximately optimal under certain strong assumptions. 

2. Preliminaries 

Gibbs samplers are commonly used MCMC algorithms for sampling from com- 
plicated high-dimensional probability distributions tt in cases where the full con- 
ditional distributions of ir are easy to sample from. To define them, let (X, B(X)) 
be an d— dimensional state space where X — X\ x ■■ ■ X Xg and write X n £ X 
as X n — (X n ,i, . . . , X n ,d)- We shall use the shorthand notation 

Xn,—i • — (X n ^i , . . . , X n { — \ , X n ,i-f-l , . . . , X n ( i) , 

and similarly X-i = X\ x ■ • ■ X X^\ x X i+ i x • • • x Xg. 

Let 7r(-|a;_i) denote the conditional distribution of Zi \ Z-i = x-i where 
Z ~ 7r. The random scan Gibbs sampler draws X n given X n -i (iteratively 
for n = 1,2,3,...) by first choosing one coordinate at random according to 
some selection probabilities a — (ai, . . . ,ad) (e.g. uniformly), and then updat- 
ing that coordinate by a draw from its conditional distribution. More precisely, 
the Gibbs sampler transition kernel P = P a is the result of performing the 
following three steps. 

Algorithm 2.1 (RSG(a) ). 

1. Choose coordinate i E {1, . . . , d} according to selection probabilities a, i.e. 
with ¥(i — j) = oij 
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2. Draw Y ~ n{-\X n -x,-i) 

3. Set X n :— (X n -i t i, . . . , X n _i i j_i, Y, X n -i,i+i, ■ ■ ■ , -Xn-i,d)- 

Whereas the standard approach is to choose the coordinate i at the first 
step uniformly at random, which corresponds to a = (1/d, . . . , 1/d), this may 
be a substantial waste of simulation effort if d is large and variability of co- 
ordinates differs significantly. This has been discussed theoretically in [26] and 
also observed empirically e.g. in Bayesian variable selection for linear models 
in statistical genetics [331 El- We consider a class of adaptive random scan 
Gibbs samplers where selection probabilities a = (ai, . . . , ay) are subject to 
optimization within some subset y C [0, l] d of possible choices. Therefore a sin- 
gle step of our generic adaptive algorithm for drawing X n given the trajectory 
X n _i, . . . , Xq, and current selection probabilities q„_i = {a n -\ 7 \, . . . , 0!n-i,d) 
amounts to the following steps, where R n (-) is some update rule for a n . 

Algorithm 2.2 (AdapRSG). 

1. Set a„ := i?„(Q„-i,I n -i, . . . ,X ) e y 

2. Choose coordinate i G {1, . . . ,d} according to selection probabilities a n 

3. Draw Y ~ 7r(-|X„_i_i) 

4- Set X n := (Xn-^i, . . . , ln-1,,-1, F, X n -l,i+l, ■ ■ ■ , -XVi-l,d) 

Algorithm 12.21 defines P n , the transition kernel used at time n, and a n plays 
here the role of L„ in the more general adaptive setting of e.g. [39l [8j. Let 
fl"n = T n (^0] &o) denote the distribution of X n induced by Algorithm 12. II or 12.21 
given starting values xq and ao, i.e. for B e B(X), 

Tr n (B)=n n ({x ,a ),B) ;=F(X n € B\X = x ,a ). (1) 

Clearly if one uses Algorithm 12 . 1 1 then ao = a remains fixed and Tr n (xo, a){B) — 
P™(xo, B). By 1 1 1/— /z 1 1 tv denote the total variation distance between probability 
measures v and /x. Let 

T(x , a , n) := ||7T„(a;o, «o) - k\\tv- (2) 

We call the adaptive Algorithm 12.21 ergodic if T(xq, ctQ,n) — > for 7r-almost 
every starting state xq and all ao G J 7 - 

We shall also consider random scan Metropolis-within-Gibbs samplers that 
instead of sampling from the full conditional at step (2) of Algorithm 12.11 (re- 
spectively at step (3) of Algorithm l2.2p . perform a single Metropolis step. More 
precisely, given X n -x t -i the z-th coordinate Xn-ij is updated by a draw Y from 
the proposal distribution Qx n _, _ < (X n _i i j, •) with the usual Metropolis accep- 
tance probability for the marginal stationary distribution w(-\X n -i t -i). Such 
Metropolis-within-Gibbs algorithms were originally proposed by [25J and have 
been very widely used. Versions of this algorithm which adapt the proposal dis- 
tributions Qx„_i,_i(-^n-i,i) •) were considered by e.g. [Hl|40], but always with 
fixed (usually uniform) coordinate selection probabilities. If instead the proposal 
distributions Qx„_i _ i (-^n-i,i ) ■) remain fixed, but the selection probabilities a^ 
are adapted on the fly, we obtain the following algorithm (where q Xt -i(x,y) is 
the density function for Q x ,-i{x, •))• 
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Algorithm 2.3 (AdapRSMwG). 

1. Set a„ := i? n K-i,^n-i, ■ • • ,X ) £ y 

2. Choose coordinate i £ {1, ...,d} according to selection probabilities a n 

3. DrawY ~Q Xn i i (X n ^,-) 
4- With probability 

min ( 1 <Y\X n ^ t ) q Xn _ 1: _ t (Y,X n - ht ) \ 

accept the proposal and set 

X n — \X n —i,i) . . . , X n —i t i—i, Y, X n ^i,i + i, . . . , X n -i.d) ; 

otherwise, reject the proposal and set X n = X n —i. 

Ergodicity of AdapRSMwG is considered in Section [5] below. Of course, if the 
proposal distribution Qx n _ t _ ( (X n _i,j, •) is symmetric about X n _i, then the 
q factors in the acceptance probability §S§ cancel out, and ^ reduces to the 
simpler probability min (l, n(Y\X n -i-i)/iv(X n ^i\X n -i t -i)). 

We shall also consider versions of the algorithm in which the proposal distri- 
butions Qx n -i -i {X n —n, ■) are also chosen adaptively, from some family {Qx-i-^-ftTi 
with corresponding density functions q X -i.yi as in e.g. the statistical genetics 
application [331 Versions of such algorithms with fixed selection proba- 
bilities are considered by e.g. [18] and [40]. They require additional adapta- 
tion parameters 7 Wj j that are updated on the fly and are allowed to depend 
on the past trajectories. More precisely, if j n = (71, 7d,n) and Q n — 
a{Xo, . . . , X n , ao, . . . , a n , 70, ■ • ■ , 7n}, then the conditional distribution of 7„ 
given Qn-i can be specified by the particular algorithm used, via a second 
update function R' n . If we combine such proposal distribution adaptions with 
coordinate selection probability adaptions, this results in a doubly-adaptive al- 
gorithm, as follows. 

Algorithm 2.4 (AdapRSadapMwG). 

1. Set a n := R n (a n -i, X n -i, . . . ,X ,*f n -i, ■ ■ • >7o) G y 

2. Set 7„ := R' n (a n ^i, X„_i, . . . ,X ,7„-i, . . . ,70) e Ti X . . . x T n 

3. Choose coordinate i £ {1, . . . , d) according to selection probabilities a, i.e. 
with P(i = j) = ctj 

4. Draw Y ~ Qx n _ 1> _ 4 ,7 n _ 1 (-X'n-i,i) ■) 

5. With probability 

. 7r(Y\X n -i i) qx n -i _ 4 ,7 n _ 1 (3^-X"n-i,i) 

mm 1, 



(x n _ 1)i ,y) / 

accept the proposal and set 

X n = . . . , Xn-i^-i, Y", X„_ li+1 , . . . , X„_ l c () ; 

otherwise, reject the proposal and set X n — X n ~\- 
Ergodicity of AdapRSadapMwG is considered in Section [6] below. 
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3. A counter-example 

Adaptive algorithms destroy the Markovian nature of (X n ) n >o, and are thus 
notoriously difficult to analyse theoretically. In particular, it is easy to be tricked 
into thinking that a simple adaptive algorithm "must" be ergodic when in fact 
it is not. 

For example, Theorem 2.1 of [24] states that ergodicity of adaptive Gibbs 
samplers follows from the following two simple conditions: 

(i) a n — > a a.s. for some fixed a € (0, and 

(ii) The random scan Gibbs sampler with fixed selection probabilities a in- 
duces an ergodic Markov chain with stationary distribution tt. 

Unfortunately, this claim is false, i.e. (i) and (ii) alone do not guarantee 
ergodicity, as the following example and proposition demonstrate. (It seems 
that in the proof of Theorem 2.1 in |24j . the same measure is used to represent 
trajectories of the adaptive process and of a corresponding non-adaptive process, 
which is not correct and thus leads to the error.) 

Example 3.1. Let N = {1, 2, . . . }, and let the state space X = {(i,j) 6NxN: 
i = j or i = j + 1}, with target distribution given by ir(i,j) oc j~ 2 . On X, 
consider a class of adaptive random scan Gibbs samplers for 7r, as defined by 
Algorithm 12.21 with update rule given by: 

. f + if <=* 

R n (a n -i,X n - 1 = (i,jy\ = l (4) 

for some choice of the sequence (a n )^ =0 satisfying 8 < a n /* oo. 

Example 13.11 satisfies assumptions (i) and (ii) above. Indeed, (i) clearly holds 
since a n — > a :— (|, |), and (ii) follows immediately from the standard Markov 
chain properties of irreducibility and aperiodicity (c.f. [331135). However, if a n 
increases to oo slowly enough, then the example exhibits transient behaviour 
and is not ergodic. More precisely, we shall prove the following: 

Proposition 3.2. There exists a choice of the (a n ) for which the process (X n ) n > 
defined in Examvle \3.1\ is not ergodic. Specifically, starting at X = (1,1), we 
have ¥(X n ^i — > oo) > 0, i.e. the process exhibits transient behaviour with positive 
probability, so it does not converge in distribution to any probability measure on 
X. In particular, ||7r„ — 7t||tv 0. 

Remark 3.3. In fact, we believe that in Proposition 13.21 P(X n l — > oo) = 1, 
though to reduce technicalities we only prove that P(X n ,i — > oo) > 0, which is 
sufficient to establish non-ergodicity. 

A detailed proof of Proposition 13.21 is presented in Section [SJ We also simu- 
lated Example 13. II on a computer (with the (a n ) as defined in Section [5]), result- 
ing in the following trace plot of X n ^\ which illustrates the transient behaviour 
since X n< i increases quickly and steadily as a function of n: 
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4. Ergodicity of adaptive random scan Gibbs samplers 

We now present various positive results about ergodicity of adaptive Gibbs sam- 
plers under various assumptions. Most of our results are specific to uniformly 
ergodic chains. (Recall that a Markov chain with transition kernel P is uni- 
formly ergodic if there exist M < oo and p < 1 s.t. \\P n (x,-) — tt(-)\\tv < 
Mp n for every x € X; see e.g. [29, 38 for this and other notions related to gen- 
eral state space Markov chains.) In some sense this is a severe restriction, since 
most MCMC algorithms arising in statistical applications are not uniformly er- 
godic. However, truncating the variables involved at some (very large) value 
is usually sufficient to ensure uniform ergodicity without affecting the statisti- 
cal conclusions in any practical sense, so this is not an insurmountable practical 
problem. We do plan to separately consider adaptive Gibbs samplers in the non- 
uniformly ergodic case, but that case appears to be considerably more technical 
so we do not pursue it further here. 

To continue, recall that RSG(a) stands for random scan Gibbs sampler with 
selection probabilities a as defined by Algorithm 12.11 and AdapRSG is the adap- 
tive version as defined by Algorithm[2?2] For notation, let Ad-i := {{pi, ■ ■ ■ ,Pd) S 
R d : Pi > 0, J2i=iPi — 1} De the (d — 1)— dimensional probability simplex, and 
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let 

y := [e, l] d n A d _! (5) 

for some < e < 1/d. We shall generally assume that all our selection proba- 
bilities are in this set y, to avoid difficulties arising when one or more of the 
selection probabilities approach zero so certain coordinates are virtually never 
updated and thus get "stuck" . 

The main result of this section is the following. 

Theorem 4.1. Let the selection probabilities a n G y for all n, with y as in f5|). 
Assume that 

(a) \a n — a n -i\ — > in probability for fixed starting values Xq G X and a® G y. 

(b) there exists /3 G y s.t. RSG(/3) is uniformly ergodic. 

Then AdapRSG is ergodic, i.e. 

T(xq, oto, n) — > as n — » oo. (6) 

Moreover, if 
(a') sv^> XQ Cto \a n — a n _i| — > in probability, 
then convergence of AdapRSG is also uniform over all Xo,cto, i.e. 

sup T(xq, oto, n ) as n — > oo. (7) 

x ,a 

Remark 4.2. 1. Assumption (b) will typically be verified for f3 = (1/d, . . . , 1/d); 
see also Proposition 14.71 below. 

2. We expect that most adaptive random scan Gibbs samplers will be de- 
signed so that \a n — a„_i| < a n for every n > 1, a?o S X, ao G y, and 
w G ft, for some deterministic sequence a n — > (which holds for e.g. 
the adaptations considered in [H])- In such cases, (a') is automatically 
satisfied. 

3. The sequence a n is not required to converge, and in particular the amount 
of adaptation, i.e. J2n°=i l a « — «n-i|, is allowed to be infinite. 

4. In Example 13. 1[ condition (a') is satisfied but condition (b) is not. 

5. If we modify Example 13.11 by truncating the state space to say X = X Pi 
({1, . . . , M} x {1, . . . , M}) for some 1 < M < oo,, then the corresponding 
adaptive Gibbs sampler is ergodic, and ([7]) holds. 

Before we proceed with the proof of Theorem 14. 1[ we need some preliminary 
lemmas, which may be of independent interest. 

Lemma 4.3. Let f3 G y with y as in (0). If RSG(fi) is uniformly ergodic, then 
also RSG(a) is uniformly ergodic for every a G y. Moreover there exist M < oo 
and p < 1 s.t. svcp XoeX ae yT(xQ,a,n) < Mp n — > 0. 

Proof. Let Pp be the transition kernel of RSG(/3). It is well known that for 
uniformly ergodic Markov chains the whole state space X is small (c.f. Theorem 
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5.2.1 and 5.2.4 in [29] with their ip = tt). Thus there exists s > 0, a probability 
measure fj, on (X, B(X)) and a positive integer m, s.t. for every x £ X, 

P^fo ■)>*/*(•)• (8) 

Fix a £ y and let 

. on 
r := nun — . 

i ft 

Since f3 £ y, we have 1 > r > > and P a can be written as a mixture 

of transition kernels of two random scan Gibbs samplers, namely 

a — t/3 

P a — rPji + (1 — r)P q , where q — — . 

This combined with © implies 

PL n (x,-) > r m P?(x,-) > r m s»(-) 

> \ 1 _ ( d _ jv j s f x v) for ever y x £ x, (9) 



By Theorem 8 of [38] condition ([9]) implies 

ii^,)-^iiw<(i-( T3 ^y 



[_n/mj 

s ) for all x £ X. (10) 



Since the right hand side of (fTOj) does not depend on a, the claim follows. □ 

Lemma 4.4. Let P a and P a i be random scan Gibbs samplers using selection 
probabilities a,a' £ y := [e, 1 — {d — l)e] d for some e > 0. Then 

||Pa(a;,-)--Pa'(^-)l|TV<-h ^< J L - (11) 



Proof. Let 5 := la — a'L Then r := min, — > — ; rr > — fr and reason- 

11 oti — e+maxj \ai—ct i \ — e+o 

ing as in the proof of Lemma T4.3I we can write P a > = rP a + (1 — r)P q for some 
q and compute 

\\P a (x,-)-P a ,(x,-)\\ TV = \\(rP a + (1 - r)P a ) - (rP a + (1 - r)P q )\\ TV 

= (l- r )\\P a -P g \\ TV <— - 

e + o 

as claimed. □ 

Corollary 4.5. P a (x,B) as a function of a on y is Lipshitz with Lipshitz 
constant 1/e for every fixed set B £ B(X). 

Corollary 4.6. If\a n — a n -i\ —> in probability, then also sup^g^ ||P an (i, ■) — 
P a , n _ 1 (x, -)\\tv — > in probability. 



K. Latuszynski and J.S.Rosenthal/Adaptive Gibbs samplers 



10 



Proof of Theorem \4-l\ We conclude the result from Theorem 1 of [39] that re- 
quires simultaneous uniform ergodicity and diminishing adaptation. Simultane- 
ous uniform ergodicity results from combining assumption (b) and Lemma 14.31 
Diminishing adaptation results from assumption (a) with Corollary 14.61 More- 
over note that Lemma 14.31 is uniform in xq and ckq and (a') yields uniformly 
diminishing adaptation again by Corollary 14.61 A look into the proof of Theo- 
rem 1 [3H| reveals that this suffices for the uniform part of Theorem 14.11 □ 

Finally, we note that verifying uniform ergodicity of a random scan Gibbs 
sampler, as required by assumption (b) of Theorem 14.11 may not be straight- 
forward. Such issues have been investigated in e.g. |33j and more recently in 
relation to the parametrization of hierarchical models (see [30] and references 
therein) . In the following proposition, we show that to verify uniform ergodicity 
of any random scan Gibbs sampler, it suffices to verify uniform ergodicity of the 
corresponding systematic scan Gibbs sampler (which updates the coordinates 
1, 2, . . . , d in sequence rather than select coordinates randomly). 

Proposition 4.7. Let a G y with y as in {5p. If the systematic scan Gibbs 
sampler is uniformly ergodic, then so is RSG(a). 

Proof. Let 

P = P 1 P 2 ..-P d 

be the transition kernel of the uniformly ergodic systematic scan Gibbs sampler, 
where Pi stands for the step that updates coordinate i. By the minorisation con- 
dition characterisation, there exist s > 0, a probability measure fi on (X ', B{X)) 
and a positive integer m, s.t. for every x G X, 

P m (x,-)>s^-). 

However, the probability that the random scan Gibbs sampler P\/d in its md sub- 
sequent steps will update the coordinates in exactly the same order is (l/d) md > 
0. Therefore the following minorisation condition holds for the random scan 
Gibbs sampler. 

^( v )>(i/rf) m V) 

We conclude that RSG(1/gQ is uniformly ergodic, and then by Lemma [4.31 it 
follows that RSG(a) is uniformly ergodic for any qG^. □ 

5. Adaptive random scan Metropolis-within-Gibbs 

In this section we consider random scan Metropolis-within-Gibbs sampler algo- 
rithms. Thus, given X n -x,-i, the i-th coordinate X n -i t i is updated by a draw Y 
from the proposal distribution Qx n -i _< {X n —i t i, ') with the usual Metropolis ac- 
ceptance probability for the marginal stationary distribution 7r( > |X 7 j_i ! _j). Here, 
we consider Algorithm AdapRSMwG, where the proposal distributions Qx n ^ 1 ^ i (-2f n — ") 
remain fixed, but the selection probabilities at are adapted on the fly. We shall 
prove ergodicity of such algorithms under some circumstances. (The more gen- 
eral algorithm AdapRSadapMwG is then considered in the following section.) 
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To continue, let P x _ t denote the resulting Metropolis transition kernel for 
obtaining X nt i\X n —n given J„_i = X—i. We shall require the following as- 
sumption. 

Assumption 5.1. For every i £ the transition kernel P x _t is uni- 

formly ergodic for every x-i G X—i. Moreover there exist Sj > and an in- 
teger mi s.t. for every X-i G X—i there exists a probability measure v X i on 
[Xi,B(Xi)), s.t. 

Px^ifau ■) > s l v x _ i (-) for every Xi G X t . 
We have the following counterpart of Theorem 14.11 
Theorem 5.2. Let a n G y for all n, with y as in Assume that 

(a) \a n — a n -i\ — > in probability for fixed starting values Xq G X and ao G y. 

(b) there exists j3 G y s.t. RSG(f3) is uniformly ergodic. 

(c) Assumption I5.il holds. 

Then AdapRSMwG is ergodic, i.e. 

T{xq, ao, n) — > as n oo. (12) 

Moreover, if 
(a') sup Xo aQ \a n - a n -i\ -> in probability, 
then convergence of AdapRSMwG is also uniform over all xo,ao, i.e. 

sup T(xo, ao, n) —t as n oo. (13) 

x ,a 

Remark 5.3. Remarks 14.21 l- l4~2l 3 still apply. Also, assumption 15.11 can easily 
be verified in some cases of interest, e.g. 

1. Independence samplers are essentially uniformly ergodic if and only if the 
candidate density is bounded below by a multiple of the stationary density, 
i.e. q(dx) > S7r(da;) for some s > 0, c.f. [27] . 

2. The Metropolis-Hastings algorithm with continuous and positive proposal 
density q(-,-) and bounded target density ir is uniformly ergodic if the 
state space is compact, c.f. 29, 38 . 

To prove Theorem 15.21 we build on the approach of [35j . In particular recall 
the following notion of strong uniform ergodicity. 

Definition 5.4. We say that a transition kernel P on X with stationary dis- 
tribution 7r is (m, s)— strongly uniformly ergodic, if for some s > and positive 
integer m 

P m {x, ■) > sn(-) for every x G X. 
Moreover, we will say that a family of Markov chains {f 7 } er on X with 
stationary distribution ir is (m, s)— simultaneously strongly uniformly ergodic, if 
for some s > and positive integer m 

P^{x, ■) > S7r(-) for every x G X and 7 G V. 
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By Proposition 1 in |35j . if a Markov chain is both uniformly ergodic and 
reversible, then it is strongly uniformly ergodic. The following lemma improves 
over this result by controlling both involved parameters. 

Lemma 5.5. Let \x be a probability measure on X , let m be a positive integer 
and let s > 0. If a reversible transition kernel P satisfies the condition 



then it is 



(( 



log(s/4) 
log(l-s) 



P m (x, ■) > s/-t(-) for every x G X, 

2j to, ^-J —strongly uniformly ergodic. 
Proof. By Theorem 8 of [38] for every A £ B(X) we have 
\\P n (x,A) -tt(A)\\ tv < (1 - a )L"/mJ ) 

And in particular 

log( S /4) 



\\P km {x,A) ~ti{A)\\tv < a/4 for k> 



log(l - s) ■ 



(14) 



Since tt is stationary for P, we have 7r(-) > s/x(-) and thus an upper bound for 
the Radon-Nikodym derivative 



d/i/d7r < 1/s. 

Moreover by reversibility 

7r(da;)P m (a;,dj7) = ir(dy)P m (y, dx) > 7r(dy)s/j(dx) 
and consequently 



(15) 



P m (x,dy) > s(^(dx)/7r(dx))Tr(dy). 



(16) 



Now define 



A := {x e X : ^i(da;)/7r(dx) > 1/2} 
Clearly fi(A c ) < 1/2. Therefore by CES]) we have 

1/2 < n{A) < (1/s)tt(A) 

and hence tt(A) > s/2. Moreover (fT4| yields 

log(s/4) 



P km {x,A) > s/4 for k:= 
And with k defined above by (TTB1) we have 



log(l - a) 



P 



km-\-m 



(x,-) = / P" m (x,dz)P m (z,-)> / P Km (x,dz)P m {z,-) 

J X J A 

> f P km (x,dz)(s/2)7r(-)>( S 2 /8)n(-). 

J A 



This completes the proof. 



□ 
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We will need the following generalization of Lemma 14.31 

Lemma 5.6. Let f3 £ y with y as in f5|). If RSG((3) is uniformly ergodic then 
there exist s' > and a positive integer to' s.t. the family {RSG(a)} a£ y is 
(to' , s')— simultaneously strongly uniformly ergodic. 

Proof. Pp(x, •) is uniformly ergodic and reversible, therefore by Proposition 1 in 
[35] it is (to, si)— strongly uniformly ergodic for some to and s\. Therefore, and 
arguing as in the proof of Lemma [4731 c.f. (O, there exist S2 > ( 1 _ ( - d £ _i) e ) > s -t- 
for every a £ y and every x £ X 

P™(x,-)>s 2 P$ l (x,-)>sis 2 7r(-). (17) 

Set to' = to and s' — s\S2- □ 



Proof of Theorem \5.Sl We proceed as in the proof of Theorem 14. 1[ i.e. estab- 
lish diminishing adaptation and simultaneous uniform ergodicity and conclude 
([12")) and l|13p from Theorem 1 of [35] ■ Observe that Lemma 14.41 applies for 
random scan Metropolis-within-Gibbs algorithms exactly the same way as for 
random scan Gibbs samplers. Thus diminishing adaptation results from as- 
sumption (a) and Corollary 14.61 To establish simultaneous uniform ergodic- 
ity, observe that by Assumption 15.11 and Lemma [5.51 the Metropolis transition 
kernel for ith coordinate i.e. P x _i has stationary distribution Tr(-\x—i) and is 

( ( log(i-s?) \ ^) m ' 1 ' if") — strongly uniformly ergodic. Moreover by Lemma UP) 
the family RSG(a), a £ y is (to', s')— strongly uniformly ergodic, therefore by 
Theorem 2 of [3S] the family of random scan Metropolis-within-Gibbs sam- 
plers with selection probabilities a £ y, RSMwG(a), is (to*, s*)— simultaneously 
strongly uniformly ergodic with to* and s* given as in [35 . □ 



We close this section with the following alternative version of Theorem 15.21 
Theorem 5.7. Let a n £ y for all n, with y as in (0). Assume that 

(a) \a n — a n —i\ — > in probability for fixed starting values xq £ X and ao £ y. 

(b) there exists (3 £ y s.t. RSMwG(f3) is uniformly ergodic. 

Then AdapRSMwG is ergodic, i.e. 

T(xo, otQ, n) — > as n oo. (18) 

Moreover, if 
(a') 5\xp Xo 0to \a n - a n -x\ -> in probability, 
then convergence of AdapRSMwG is also uniform over all xg,ao, i.e. 

sup T[xq, ao, n) — > as n — > oo. (19) 



Proof. Diminishing adaptation results from assumption (a) and Corollary 14.61 
Simultaneous uniform ergodicity can be established as in the proof of Lemma [4.3l 
The claim follows from Theorem 1 of [39]. □ 
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Remark 5.8. Whereas the statement of Theorem 15.71 may be useful in spe- 
cific examples, typically condition (b), the uniform ergodicity of a random scan 
Metropolis-within-Gibbs sampler, will be not available and establishing it will 
involve conditions required by Theorem 15.21 

6. Adaptive random scan adaptive Metropolis-within-Gibbs 

In this section, we consider the adaptive random scan adaptive Metropolis- 
within-Gibbs algorithm AdapRSadapMwG, that updates both selection probabili- 
ties of the Gibbs kernel and proposal distributions of the Metropolis step. Thus, 
given X n _i t _i, the i-ih coordinate X n -i t i is updated by a draw Y from a pro- 
posal distribution Qx„_i -i,-y ni (-Xn-i,i> ') with the usual acceptance probability. 
This doubly-adaptive algorithm has been used by e.g. [12] for an application in 
statistical genetics. As with adaptive Metropolis algorithms, the adaption of the 
proposal distributions in this setting is motivated by optimal scaling results for 
random walk Metropolis algorithms [321 ESI El HDl SI ISl EHl SI] - 

Let P x _i, 7 „ t denote the resulting Metropolis transition kernel for obtaining 
X n ,i\X n -is given X n ^i t -i = X—i. We will prove ergodicity of this generalised 
algorithm using tools from the previous section. Assumption 15.11 must be refor- 
mulated accordingly, as follows. 

Assumption 6.1. For every i G {1, . . . ,d}, x-i G X-i and 7, G the tran- 
sition kernel P X -i, -y t is uniformly ergodic. Moreover there exist Si > and an 
integer mi s.t. for every X-i G X—i and 7$ G I\ there exists a probability measure 
v X -i,~ri on (Xi,B(Xi)), s.t. 

P ™X 7< ( x i) > s i y x-i, ji(-) for every x { G X t . 
We have the following counterpart of Theorems 14.11 and 15.21 
Theorem 6.2. Let a n G y for all n, with y as in Assume that 

(a) \a n — a n -x\ ~^ in probability for fixed starting values Xq G X and ao G y. 

(b) there exists j3 G y s.t. RSG(f3) is uniformly ergodic. 

(c) Assumption \6.1\ holds. 

(d) The Metropolis-within-Gibbs kernels exhibit diminishing adaptation, i.e. 
for every i G {1, . . . , d} the G n +l measurable random variable 

sup ||P K _ i , 7 „ +1 ( {xi, •) — P X -i, 7„ i ')\\tv ~ ► in probability, as n — >• 00, 

for fixed starting values xo G X and ao G y. 
Then AdapRSadapMwG is ergodic, i.e. 

T(xq, ao, n) — > as n — » 00. (20) 

Moreover, if 
(a') sup Xg ao |a„ - a n _i| in probability, 
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(d') sup a0)ao sup xeX \\Px-i,y n+ i,i( x ii , )-- p a ! - 1 ,7»,i( a; i) OIItv -> in probability, 
then convergence of AdapRSadapMwG is also uniform over all Xo,ao, i.e. 

sup T(xq, cto, n) — > as n — > oo. (21) 

x a ,a 

Remark 6.3. Remarks 14. 21 l- |4~2l 3 still apply. And, Remark l5. 31 applies for veri- 
fying Assumption 16.11 Verifying condition (d) is discussed after the proof. 

Proof. We again proceed by establishing diminishing adaptation and simulta- 
neous uniform ergodicity and concluding the result from Theorem 1 of |39j . 
To establish simultaneous uniform ergodicity we proceed as in the proof of 
Theorem 15.21 Observe that by Assumption 16.11 and Lemma 15.51 every adap- 
tive Metropolis transition kernel for ith coordinate i.e. P x _ ( , 74 has stationary 

distribution n(-\x^i) and is (( j + 2^ m„ ^-J —strongly uniformly er- 

godic. Moreover, by Lemma 15.61 the family RSG(a), a £ y is (m', s')— strongly 
uniformly ergodic, therefore by Theorem 2 of |35j the family of random scan 
Metropolis-within-Gibbs samplers with selection probabilities a £ y and pro- 
posals indexed by 7 £ T, is (to*, s*)— simultaneously strongly uniformly ergodic 
with to* and s* given as in |35j . 

For diminishing adaptation we write 

sup ||P„ ni ln {x, ■) - P a „_i, 7„_i(^) OIItv < 

sup ||P 0n) 7n (x, •) - P a „_i, -y n (x, -)IItv 

+ sup ||P Q „„ 1 

fa) OIItv 



The hrst term above converges to in probability by Corollary [476] and assump- 
tion (a). The second term 

sup \\P an _ u Jn (x, - P an - U y n -dx, OIItv < 



^ Ot n -l,i SUp ||Pc_ i; 7 n+M (Xi, - Pr-i, 7„,i( a; i; OIItV 



converges to in probability as a mixture of terms that converge to in prob- 
ability. □ 



The following lemma can be used to verify assumption (d) of Theorem 16.2 
see also Example 16.51 below. 



Lemma 6.4. Assume that the adaptive proposals exhibit diminishing adaptation 
i.e. for every i £ {1, . . . , d} the Gn+i measurable random variable 

sup ||Qx_ ( , 7„+i ~ Qx-i, in i( x ii OIItv —> in probability, as n — >• 00, 
xex 

for fixed starting values xq £ X and Qo £ y~ 
Then any of the following conditions 
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(i) The Metropolis proposals have symmetric densities, i.e. 

Qx — i, V^il Vi) Q.X — i, 7n,i \Vit 

(ii) Xi is compact for every i, 7r is continuous, everywhere positive and bounded, 
implies condition (d) of Theorem 1 6. £1 

Proof. Let Pi, Pi denote transition kernels and Q±, Q2 proposal kernels of two 
generic Metropolis algorithms for sampling from 7r on arbitrary state space X. 
To see that (i) implies (d) we check that 

\\P 1 (x,-)-P 2 (x,-)\\ T v < 2\\Q 1 (x,-) - Q 2 (x,-)\\ TV . 
Indeed, the acceptance probability 

a(x,y) =minfl,44} € [0,1] 
does not depend on the proposal, and for any x € X and A € B(X) we compute 



\P 1 (x,A)-P 2 (x,A)\ < 



a(x,y)(qi(y) - q2(y))dy 



x 



(l - a(x,y))(q 1 (y) - q 2 (y))dy 



< 2||Qi(a:,0-Qa(a:,0ll 



TV- 



Condition (ii) implies that there exists K < 00, s.t. ■n(y)/Tt(x) < K for every 
x,y £ X. To conclude that (d) results from (ii) note that 



min{a, 6} — min{c, d}\ < a — c| + |6 — d\ 



(22) 



and recall acceptance probabilities oti(x,y) = min |l, }■ Indeed for 

any xeA' and A e S(A') using (|22|) we have 



f min{gi(x,2/), ^|(ji(y,a;)| 



(<?2(a;,y), ^yQ 2 (y,a;)|jdy 



r(x) 

+ ^{xeA} J ((l- ai(aJ, J/)) Qi(x,y) 

-(l - a2(x,y))gr 2 (a;,y))dy 
< 4(^+1)||Q 1 (x,-)-Q 2 (^-)I|tv 

And the claim follows since a random scan Metropolis-within-Gibbs sampler is 
a mixture of Metropolis samplers. □ 
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We now provide an example to show that diminishing adaptation of proposals 
as in Lemma 16.41 does not necessarily imply condition (d) of Theorem 16.21 so 
some additional assumption is required, e.g. (i) or (ii) of Lemma 16.41 

Example 6.5. Consider a sequence of Metropolis algorithms with transition 
kernels Pi, P2, . . . designed for sampling from 7r(fc) = p k (l— p) on X = {0, 1, . . . }. 
The transition kernel P n results from using proposal kernel Q n and the standard 
acceptance rule, where 



Qn(j,k) = q n (k) 
Clearly 



p k (j^ — p n + p 2n ) for k ^ n, 

p 2n (-^ - p n + p 2n y X for k = n. 



SUp \\Q„+l(j, ') - Qn(j, -)\\tV = Qn+1 (n) - q n («) ~> 0. 



However 



sup\\P n+1 (j,-)-P n (j,-)\\ T v > P„+i(n,0)-P n (n,0) 



min (q n+1 (0), ^j^-q n+1 (n)\ 
I TT(n) ) 

- mm\q n (0),^\q n (n)X 
I ir(n) J 

q n +i(0) - q n {0)p n -+l-p^0. 



7. A specific Metropolis-within-Gibbs adaptive choice 

As an application of the previous section, we discuss a particular method of 
adapting the selection probabilities for the doubly-adaptive Metropolis-within- 
Gibbs algorithms. We are motivated by two closely-related componentwise adap- 
tation algorithms, from 18 and from Section 3 of [40]. Briefly, these algorithms 
use a deterministic scan Metropolis-within-Gibbs sampler and perform a random 
walk Metropolis step for updating coordinate i by proposing a normal increment 
to X n -i t i, i.e. the proposal Y n .i ~ N(X n _i t i, <r^ 4 ). The proposal variance cr 2 i 
is subject to adaptation. Haario et al. in [TH] use 



a 



2,HST /r, A\2r 2 



n , i 



(2.4)^« 4 + 0.05), (23) 



where s 2 li is the sample variance of Xo,i, ■ ■ ■ , X n ^i t i, whereas Roberts and 
Rosenthal in [30] take 

2,RR 1st / 0A \ 

°n,i = e . ( 24 ) 

and Isi is updated every batch of 50 iterations by adding or subtracting 5(n) = 
0{rT x l 2 ). Specifically, Isi is increased by 5(n) if the fraction of acceptances of 
variable i was more then 0.44 on the last batch and decreased if it was less. 
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Both rules have theoretical motivation, c.f. [37] : cr^'^ ST comes from diffu- 
sion limit considerations in infinite dimensions and a n \ is motivated by one 
dimensional Gaussian target densities. Conclusions drawn in this very special 
situations are observed empirically to be robust in a wide range of examples 
that are neither high-dimensional nor Gaussian [37] [40] . 

In this section, we use a random scan Gibbs sampler instead of a deterministic 
scan, and optimise the coordinate selection probabilities on simultaneously with 
proposal variances. We aim at minimizing the asymptotic variance. Under cer- 
tain strong conditions (Assumption [7TTT) that allow for illustrative analysis and 
explicit calculations, we shall provide approximately optimal adaptions for the 
on in equations (|43]) and ((44)) below, and shall prove ergodicity of the correspond- 
ing algorithms in Theorem 17.31 More general adaptation algorithms for random 
scan Gibbs samplers have been investigated by others (e.g. [26 ] [24 ] 122 ] [23]). 

Assumption 7.1. The following conditions hold. 

(i) The stationary distribution on X — R d is of the product form 

d 

n(x) = Y[Cig(CiXi), (25) 

i=l 

where g is a one dimensional density and Ci, i = 1, . . . , d, are unknown, 
strictly positive constants, 
(ii) The second moment of g exists, i.e. a 2 :— Var g Z < oo. 
(Hi) The one- dimensional random walk Metropolis algorithm with N(x, 1) pro- 
posal distributions and target density g is uniformly ergodic. 

We consider an adaptive random scan adaptive random walk Metropolis- 
within-Gibbs algorithm AdapRSadapMwG, with Gaussian proposals, for estimat- 
ing expectation of a linear target function 

d 

f(x) = ao + ^aiXi. (26) 

2=1 

A random scan Gibbs sampler for a target density of product form (|25[) 
is uniformly ergodic, therefore arguing as in the proof of Theorem 16.21 under 
Assumption 17. II a random scan Metropolis- within-Gibbs with N(x, 1) proposals 
is uniformly ergodic. Moreover, by (ii), function / defined in (|2T)1) is square 
integrable and the Markov chain CLT holds, i.e. for any initial distribution of 
X 

n ~ 1/2 (j2f( X i)-n E «f( X )) N(0,al), as n ^ oo, (27) 

where the asymptotic variance o~\ s < oo can be written as 

o-i = T f Var^f(X), and (28) 

OO 

T f = l + 2j2CorM(Xo),f(X k )), (29) 
fe=i 
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is the stationary integrated autocorrelation time. Markov chain CLTs and asymp- 
totic variance formulae are discussed e.g. in [551 UHl HI]- Note that under As- 
sumption [7T] the asymptotic variance decomposes and some explicit computa- 
tions are possible. 

d d 

= = J2 T f< iVar ^f( x ^ where ( 3 °) 

i=l i=l 
oo 

T fti = l + 2^Co7v(X , 4 ,X M ), and (31) 
fc=i 

Var Wti f(X) := Var«{ ai X 0>i ) = ^a 2 . (32) 

To compute r/^ for a random scan Metropolis-within-Gibbs sampler in the 
present setting, we focus solely on coordinate i, i.e. the Markov chain X n ,i, 
n = 0, 1, . . . Due to the product form of 7r, the distribution of X n ^\X n does not 
depend on X n ^i. Let Pi be the transition kernel that describes the dynamics 
of X nt i, n — 0,1,... and let a — (ai, ... ,otd) denote the (fixed) selection 
probabilities. We write Pi as a mixture 

Pi{xi,.) = (l-a i )Id + a i if etrop ( a;i ,-) ) (33) 

where Id denotes the identity kernel and P i etrop performs a single Metropolis 
step for the target distribution dgiCiX). Thus Pi is a lazy version of p. Metr °P 5 
since it performs a etro P s tep if coordinate i is selected with probability at 
and an identity step otherwise. We will use Lemma l7T2l below. which is a general 
result about asymptotic variance of lazy reversible Markov chains. Suppose 

h S L 2 {-k) := {h e L 2 {n) : %h = 0}, 

and denote 

1 

a h h := 1™ —Var\ y h(Zi 

v i=0 

where Zq, Zi , • • • is a Markov chain with transition kernel H and initial distri- 
bution 7r that is stationary for H. 

Lemma 7.2. Let P be a reversible transition kernel with stationary measure n. 
Let 6 € (0, 1) and by P5 denote its lazy version 

Ps = (l-5)Id+SP. 

Then 



1 9 1-6 

t 



Th,p s = 1 A,P + ^-^ h2 - ( 34 ) 
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Proof. The proof is based on the functional analytic approach (see e.g. [20| 
34 ). A reversible transition kernel P with invariant distribution 7r is a self- 
adjoint operator on Lq(it) with spectral radius bounded by 1. By the spectral 
decomposition theorem for self adjoint operators, for each h £ Lq(tt) there exists 
a finite positive measure E^^p on [—1, 1], such that 

(h,P n h) = / x n E h , P (dx), 
J [-1,1] 

for all integers n > 0. Thus in particular 



nh' = / lE htP (dx), (35) 



al? = / \±^E htP (dx). (36) 



Since 

pf = ((i-s)id + sp) n = J2('')(i-s) k s n - k P n - k , 



k=0 



we have 



(h,Pgh) = / ((1 - S) + Sx) n E hP (dx), and consequently 



i 1 + 1- 5 + 5x 
= l._ ltl] 1 - 1 + S -Sx EhAdx) 



I - x (^ + l-s)E h>P (dx) 
J [-1,1] 5 V 1 ~ x / 

t / . E h<P (dx) + — ^ / lSfe.P (dx) 

d J\-i.i] l ~ x d J\-i,i] 



5 J i-i,i] 1 - x J l-i,i] 

as claimed. □ 
Let 

= Tf ti Var v (aiX 0ti ) = T fyi -^a 2 , (37) 

be the asymptotic variance of the Metropolis kernel j? Mctro P defined in (f3"3"|) . 
Here ff t is its stationary integrated autocorrelation time. From Lemma 17.21 we 
have the following formula for a 2 as i of (|3H| . 



2 1-2 I — ai af 2 

°a S ,i = 7" CT as,i + ~F*2 ® 1 hcnCC 

a, a,- C/ 



1 - . 1 - on ,„„, 
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Now we take advantage of the fact that / is linear and of the actual adaptation 
of the proposal variances performed by both versions, i.e. HST and RR. Namely, 
they aim at minimizing their integrated autocorrelation time f aS) i. Under As- 
sumption [7T] the conditional distributions are equally shaped up to the scaling 
constant Cj. However the adaptive algorithm will learn Cj and adjust the pro- 
posal variance accordingly. We conclude that after an initial learning period the 
following proportionality relation will hold approximately 

2, HST 2,RR -i lr ,2 

and also the stationary integrated autocorrelation times for the adapted jo Metr °P 
will be close to the (unknown) optimal value, say T, i.e. 

f„ ti « T. (39) 

Typically T>1, hence we can approximately write (using ([55]). ([3"U|). (j3"Tj) , (j3"2"j) 
and (|33)l) 

To 2 

1 o- 2 , and finally (40) 



2 



d 2 d 2, HST 2 d 2,RR 2 

The last expression is minimised for 

( 2, HST 2\ 1/2 / 2,RR 2\ 1/2 ,, \ 

on cx ^cr„; 2 a,j oc \a nl a t j , (42) 



which yields a very intuitive prescription for adapting selection probabilities, 
namely by setting 



2, HST 2 



1/2 



a"^ T := ^— ^ -pr for the HST version of [E], and (43) 



s-^d ( 2. HST 



V^d / 2,RR 
Efc=l { a n } k a 



1/2 

1/2 



/ 2,RR 2 Y 

:= ^- '- m for the RR version of @D]. (44) 



The above argument shows: and \44\ ) are approximately optimal choices of 
adaptive selection probabilities for these algorithms, at least for target densities 
of the form [25\) . 

We next prove ergodicity of these algorithms. Let HST-algorithm denote 
an AdapRSadapMwG that uses (f2"3"|) for updating proposal variances and (|4"3"]l for 
updating selection probabilities. Similarly let RR-algorithm follow ([Ml) and (144j) 
with additional restriction for 1st to stay in [-M, M] for some fixed, large M < 
oo (which technically plays the role of 0.05 in (f23|) for the HST-algorithm). 
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Theorem 7.3. Under Assumption] 7. 1\ the HST- and RR- algorithms are ergodic. 



Proof. It is enough to check that the assumptions of Theorem l6.2l are satisfied. 
We do this for the HST-algorithm; the proof for the RR-algorithm follows in 
the same way. Condition (b) is immediately implied by Assumption 17.11 (i), 
since (b) requires only that the full Gibbs sampler is uniformly ergodic, which 
is obvious for a product target density of the form (|25|) . Next, observe that 
Assumption 1 7 . 1 1 (hi) implies that the support of Cig(Cix), say S g< i, is bounded, 
therefore the sample variance estimate in (|23l) is bounded from above and for 
the HST-algorithm, for every i £ {1, . . . , d}, 



2. HST 



G [(2.4) 2 0.05, K t ] =: S* ti (45) 



for some Ki < oo. Thus (a) holds since the denominator in (|43[) is bounded from 
below and the change in sample variance 



2, HST _ 2, HST 
a n,i a n+l,i 



= Oin- 1 ). (46) 



Condition (d) results from (j46]l. flU} and Lemma l6Tl (i) . We are left with (c). 
Let (j)cr(') denote the density function of N(0, a 2 ). Since 

sup <p ai (x - y)/4> a2 {x - y) < oo, 

i£{l,...,<£};a;,2/£iS 9> i;<Ti ,a 2 £S<r,i 

the Radon-Nikodym derivative of all pairs of proposals for every coordinate is 
bounded and hence Assumption ^. H is implied again by Assumption 17. II (iii) . □ 

Remark 7.4. 1. Condition (i) of Assumption [7J] is very restrictive, however 
it already proved extremely helpful in understanding high dimensional 
MCMC algorithms via diffusion limits [32l [37l |9] , and conclusions drawn 
under (i) are empirically observed to be robust even if the condition is vio- 
lated. It is essential to investigate its robustness also in the Gibbs sampler 
setting. 

2. Minor generalisations to (i) are straightforward, e.g. our conclusions hold 
for X = l\f =1 Xi, where X t = R fc . 

3. Condition (iii) of Assumption ^. H is required to ensure asymptotic validity 
of our algorithm by Theorem 16.21 We will report separately on ergodicity 
of adaptive random scan Gibbs samplers in the non-uniform case. 



8. Proof of Proposition [3721 

The analysis of Example 13. II is somewhat delicate since the process is both time 
and space inhomogeneous (as are most nontrivial adaptive MCMC algorithms) . 
To establish Proposition I3.2| we will dehne a couple of auxiliary stochastic pro- 
cess. Consider the following one dimensional process (X n )„>o obtained from 
(A„)„>o by 

An := X nt i + X n _2 — 2. 
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Clearly X n — X n -i £ { — 1,0, 1}, moreover X n _\ — \ oo and X n ^ — > oo if and 
only if X n — > oo. Note that the dynamics of (l„) n > are also both time and 
space inhomogeneous. 

We will also use an auxiliary random-walk-like space homogeneous process 

n 

So = and S n := Yi, for n > 1, 

i=l 

where Yi, Y 2 , . . . are independent random variables taking values in { — 1, 0, 1}. 
Let the distribution of Y n on { — 1,0, 1} be 



Jl__L I I _L\ 

" n \4 a n ' 2 ' 4 + aJ- 



(47) 



We shall couple (X n )„>o with (S n ) n >o, i- e - define them on the same prob- 
ability space {Q, J 7 , P}, by specifying the joint distribution of (JT„, 5„) n > so 
that the marginal distributions remain unchanged. We describe the details of 
the construction later. Now define 

^x>s : = {^ <= f2 : X n (oj) > S n (uj) for every n} (48) 

and 

iloo := {oj G £1 : S n (u) ->• oo}. (49) 

Clearly, if uj G ^^> s H fioc, then X n (u) — > oo. In the sequel we show that for 
our coupling construction 

p(si^ >s nfi 00 ) > o. (50) 

We shall use the Hoeffding's inequality for S^ +n := Sk+ n — <Sfc- Since Y n G 
[—1, 1], it yields for every t > 0, 

P(S£ +n - E^+" < -nt) < cxp{-int 2 }. (51) 

Note that EY„ = 2/o n and thus ES* +n = 2 Ei=fe+i V a i- The following choice 
for the sequence a n will facilitate further calculations. Let 

bo = 0, 
h = 1000, 

&n = 6„_i(l + — — — p-^), for n>2 
V 10 + log(n)/ 

n 
i=0 

a n = 10 + log(fc), for c fe _i < n < c k . 
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Remark 8.1. To keep notation reasonable we ignore the fact that b n will not be 
an integer. It should be clear that this does not affect proofs, as the constants 
we have defined, i.e. b\ and a\ are bigger then required. 

Lemma 8.2. Let Y n and S n be as defined above and let 
Oi := £ H : S k = k for every < k < c x }. (52) 

Q n := |w £ O : Sk > 1 /or euery c„_i < fc < c n | /or n > 2. (53) 
Then 

f( f| ft n ) > 0. (54) 



71=1 



Remark 8.3. Note that 6„ /* oo and therefore H^li ^« C ^oo- 

Proof. With positive probability, say we have Y~i = • • • = Yiooo = 1 which 
gives S Cl = 1000 = b\. Hence P(Oi) = pi,s > 0. Moreover recall that S^_ x is a 
sum of b n i.i.d. random variables with ES°*_ 1 — iqjgfej ■ Therefore for every 
n > 1 by Hoeffding's inequality with i = 1/(10 + log(rt)), we can also write 

P (^- " 10 + log(n)) " CXP { ^(lO + bgH) 2 } = : P - 
Therefore using the above bound iteratively we obtain 

oo 

P(5 C1 = h, S Cn > b n for every n > 2) > p hS FJ (1 - Pn ). (55) 

n=2 

Now consider the minimum of Sk for c„_i < fc < c n and n > 2. The worst case 
is when the process Sk goes monotonically down and then monotonically up for 
c„_i < fc < c„. By the choice of 6 n , equation ([53)1 implies also 



(56) 



n >pi, s n( i -p»)- 

n=l ' ra=2 

Clearly in this case 

oo oo oo 

Pi,sY[(^-Pn) > & ^log(l -p n ) > -oo <S> ^]p„<0O. (57) 

n—2 n—1 n—1 

We conclude (|57]l by comparing p„ with l/n 2 . We show that there exists no 
such that for n > no the series p n decreases quicker then the series 1 /n 2 and 
therefore p n is summable. We check that 

ri 2 

log ^-i > log- — - for n > n . (58) 

Pn (n-iy 
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Indeed 

Pn-l 1 / b n -l K 



l0g : 



Pn 2 V(10 + log(n-l)) 2 (10 + log(n)) 5 

&„_l / ll + log(n) 1 



(10 + log(n)) 3 (lO + log^-l)) 1 
&„_i / (11 + log(n))(10 + log(n - l)) 2 - (10 + log(n)f 



(10 + log(n)) 3 (10 + log(n- l)) 2 

Now recall that 6 n _i is an increasing sequence. Moreover the enumerator can 
be rewritten as 

(10 + log(n)) ((10 + log(n - l)) 2 - (10 + log(n)) 2 ) + (10 + log(n - l)) 2 , 

now use a 2 — b 2 = (a + b)(a — b) to identify the leading term (10 + log(n — l)) 2 . 
Consequently there exists a constant C and no € N s.t. for n > no 

1 P"- 1 > C 2 n 2 

° g p n ~ (10 + log(n)) 3 > n-1 > g (n-l) 2 ' 

Hence Y^=\Pn < 00 follows. □ 

Now we will describe the coupling construction of (X n ) n >o and (S n ) n >o- We 
already remarked that H^Li ^« c We will define a coupling that implies 
also 

pf ^ P| fi„^ nf^> s ^j >CP^f) n n ^j for some universal C > 0, (59) 
and therefore 

p ( fi x>s nfi oo) > 0. (60) 

Thus nonergodicity of (X n ) n >o will follow from Lemma T8.2I We start with the 
following observation. 

Lemma 8.4. There exists a coupling of X n — X n -\ and Y n , such that 
(a) For every n > 1 and every value of X n -\ 

P(X n - X n _i = l,Y n = 1) > P(X n - X n -i = 1)P(Y„ = 1), (61) 

(6J Write even or odd X n -i as X n -\ — 2i — 2 or JC n _i = 2i — 3 respectively. 
If 2i — 8 > a n then the following implications hold a.s. 

Y n = l => X n -X n _ 1 = l (62) 
X„-X„_i = -1 F„ = -l. (63) 
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Proof. Property (a) is a simple fact for any two { — 1,0,1} valued random 
variables Z and Z' with distributions say {^1,^2,^3} and {d^d^d'A. Assign 
P(Z = Z' = 1) := min{<f 3,(^3} and (a) follows. To establish (b) we analyse the 
dynamics of (A n )„>o and consequently of (X n )n>o- Recall Algorithm 12.21 and 
the update rule for a n in (j3J). Given X n _i = the algorithm will obtain 

the value of a n in step 1, next draw a coordinate according to (ot n i ) a n 2) in 
step 2. In steps 3 and 4 it will move according to conditional distributions for 
updating the first or the second coordinate. These distributions are 



(1/2,1/2) and 



(*-l) 2 



i 2 + (i — l) 2 ' i 2 + (i - l) 2 



respectively. Hence given A„_i = (i, i) the distribution of X n G 
l,i)} is 

14, i 2 A 4 . i 2 A 2 , 1 2 \ 

■2~^z 2 + (i-l) 2 ' 1- ^2~^i 2 + (i-l) 2 M + ^ ' 4 + ^J' ( ' 

whereas if X n —\ = (i,i — 1) then X n G {(i — l,i — — l),(i,i)} with 

probabilities 

4 a,,' 1 M a J ^2 aj i 2 + (i - l) 2 ' 4 aj i 2 + (i - l) 2 / ' l °° J 

respectively. We can conclude the evolution of (X n ) n >Q. Namely, if A„_i = 
2i — 2 then the distribution of A„ — X n -\ G { — 1,0, 1} is given by (|64|) and if 
I„_i = 2i — 3 then the distribution of X n — A„-i G { — 1,0,1} is given by (|65|) . 
Let < st denote stochastic ordering. By simple algebra both measures defined in 
(|64|) and f|65[) are stochastically bigger then 

Mn = (/4,l,<,2>/4,3)> ( 66 ) 

where 

i ,12 2 11 2^ + 8 - o» 

^ = + = 4 ~ V n ~ 2 ian ' (67) 

< 2 = ^(l-^^+^-^ + ^^-m^})' 

/ 1 2 w _ 2 1 _1_ 2max{4^}-8-a» 

^ n < 3 M Of,' max{4,i}'' 4 a n 2a„max{4,i} " l J 

Recall f „ , the distribution of F„ defined in (|4"T)) . Examine (foTf and to see 
that if 2i — 8 > a n , then \i % n > st z/„. Hence in this case also the distribution 
of X n — X n -i is stochastically bigger then the distribution of Y n . The joint 
probability distribution of (X n — X n _i,Y n ) satisfying (1521 and follows. □ 



Proof of Proposition \3.2i Define 

Q 1 ^ := jej G £1 : X n — X n -\ = 1 for every < n < Ci|. (69) 
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Since the distribution of X n — X n -\ is stochastically bigger then fi l n defined in 
(j6"6")l and /xjj(l) > c > for every i and n, 

By Lemma 18.41 (a) we have 

P(fi 1)A nfli) >pi,sp h x>0. (70) 

Since S Cl = X Cl = c\ = b%, on Q 1 ^ n f2i, the requirements for Lemma 18.41 (7>J 
hold for n — 1 = ci . We shall use Lemma 18.41 (b ) iteratively to keep X n > S n 
for every n. Recall that we write X„_i as X n _\ = 2i — 2 or X n -x = 2« — 3. If 
2i — 8 > a n and > S n _i then by Lemma l8~4l (b) also X„ > £„. Clearly if 

Xk > and Sk > ^f 1 for c„_i < k < c n then A\. > - r rf i for c„_i < k < c n , 
hence 

2i — 2 > " 1 for c„_i < k < c n . 

This in turn gives 2i — 8 > ^p- — 6 for c n -i < k < c n and since ak = 10 + log(n), 
for the iterative construction to hold, we need b n > 32 + 21og(n + 1). By the 
definition of b n and standard algebra we have 

K > 1000 f 1 + V l—rx I > 32 + 2 logfn + 1) for every n > 1. 

Summarising the above argument provides 




oo 

^ Pl,xPhS H (1 - Pn) > 0. 
n=2 



Hence (X n ) n >o is not ergodic, and in particular ||7r n — 7t||tv 0. □ 
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